library(ggplot2)
df = subset(insurance, select = -c(region))
head(df,4)
The given data set provides us with the age, sex, BMI( providing an understanding of body, weights that are relatively high or low relative to height) and number of children covered for the given individual along with their Individual medical costs billed by health insurance
Understand Data: Data source identification and understanding:-
The above data is collected from the website kaggle(Medical Cost Personal Data sets) link: https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download The data set gives us insight on how the health and other factors can have major impact on the insurance a person is buying and how is the insurance companies going to pay an individuals medical expenses in the future.Also this study will also help us to identify how to lower the charges by selecting the clients based on their lifestyle(BMI, age, smoker etc).
##UNDERSTANDING DATA
Describe your data in terms of the following aspects: concept of learning: in the given data set, there are 4 input independent parameters and one dependent parameter, we can presume it as a problem which can be solved using linear regression with multiple variables
Data Attributws:
colnames(df)
[1] "age" "sex" "bmi" "children" "smoker" "charges"
The given data set has 5 attributes- age, sex, bmi, children, and smoker namely. Also the target variable is charges. age: age of primary beneficiary sex: insurance contractor gender, female, male bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9 children: Number of children covered by health insurance / Number of dependents smoker: if the primary benificiary smokes or not charges: Individual medical costs billed by health insurance
head(df, 1)
In this instance of the data, primary benificary is a 19 year old female, her BMI is 27.9 and has 0 children’s/ number of dependents. She does smokes and her charges were 16884.92$.
counting the number of different parameters in data frame
table(df['sex'])
sex
female male
662 676
table(df['children'])
children
0 1 2 3 4 5
574 324 240 157 25 18
table(df['smoker'])
smoker
no yes
1064 274
ggplot(df, aes(x=sex)) + geom_bar()
the graph shows us that there are almost equal number of data points for both male and females, which are taken in the dataset.
ggplot(df, aes(x = children)) +geom_bar()
the categorical data suggests that the maximum number of people had 0 childrens covered by the insurance, followed by 1,2,3,4,5 childrens
ggplot(df, aes(x = smoker)) +geom_bar()
the category suggests that the people who opted for the insurance to cover the charges, maximum of them were non smokers.
hist(df$charges)
data distribution with log transformation the above histogram shows us that the most of the users have their charges in the range of 0 to 15,000$.
hist(log(df$charges))
after taking the log of charges, we can clearly see that the distribution becomes fairly gaussian.
plot(df$bmi, df$age ,xlab="bmi of customer", ylab="age of customer", pch=19,col = "orange")
plot(df$bmi, df$charges ,xlab="bmi of customer", ylab="charges of customer", pch=19,col = "orange")
drawing a scatter plot of bmi and log(charges)
plot(df$bmi, log(df$charges) ,xlab="bmi of customer", ylab="log of charges of customer", pch=19,col = "orange")
while for most of the bmi in range 0 to 30 the charges remained typically in the range of 0 to 20,000, for some bmi in the range of 30-50 the charges went up from the range of 40,000$ to 60,000+$’s
summary(df)
age sex bmi children smoker
Min. :18.00 Length:1338 Min. :15.96 Min. :0.000 Length:1338
1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000 Class :character
Median :39.00 Mode :character Median :30.40 Median :1.000 Mode :character
Mean :39.21 Mean :30.66 Mean :1.095
3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
Max. :64.00 Max. :53.13 Max. :5.000
charges
Min. : 1122
1st Qu.: 4740
Median : 9382
Mean :13270
3rd Qu.:16640
Max. :63770
boxplot(df$charges)
analysing the boxplot for charges shows us again, where most of the data is concentrated
boxplot(df$children)
analysing the boxplot we get that most of the people get 0-2 childrens covered under their insurance
boxplot(df$bmi)
analysing the density of the charges
plot(density(df$charges))
plotting log(charges)
plot(density(log(df$charges)))
plotting charges by age
plot(df$age, df$charges ,xlab="age of customer", ylab="charges of customer", pch=19, col = "orange")
plotting age with log of charges for each customer
plot(df$age, log(df$charges) ,xlab="age of customer", ylab="log of charges of customer", pch=19, col = "orange")
cor(df$bmi, df$charges, method = "pearson", use = "complete.obs")
[1] 0.198341
By using the pearsons coefficient, bmi and charges are only 0.19 correlated.
cor(df$age, df$charges, method = "spearman", use = "complete.obs")
[1] 0.5343921
By using the spearman’s coefficient formula, the correlation is 0.534, thus somewhat co-related.
x <- df$age
y <- df$charges
# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
plot(x, y, main = "Main title",
xlab = "age", ylab = "charges",
pch = 19, frame = FALSE)
# Add regression line
plot(x, y, main = "Main title",
xlab = "age", ylab = "charges",
pch = 19, frame = FALSE)
abline(lm(y ~ x, data = df), col = "red")
the red line poorly fits the data, Also the data looks very scattered to be drawing a linear relationship between age and charges. plotting the linear regression line to fit the data for 2 attributes at a time
plotting a regression line between age and log(charges)
x <- df$age
y <- log(df$charges)
# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
plot(x, y, main = "Main title",
xlab = "age", ylab = "log of charges",
pch = 19, frame = FALSE)
# Add regression line
plot(x, y, main = "Main title",
xlab = "age", ylab = "log of charges",
pch = 19, frame = FALSE)
abline(lm(y ~ x, data = df), col = "red")
cor(df$age, df$charges, method = "pearson", use = "complete.obs")
[1] 0.2990082
This tells us that age and charges are slightly related with pearsons correlation of 0.299
cor(df$age, df$charges, method = "spearman", use = "complete.obs")
[1] 0.5343921
By using the spearman’s coefficient formula, the correlation is 0.534, thus somewhat co-related.
x <- df$bmi
y <- df$charges
# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
plot(x, y, main = "Main title",
xlab = "bmi", ylab = "charges",
pch = 19, frame = FALSE)
# Add regression line
plot(x, y, main = "Main title",
xlab = "bmi", ylab = "charges",
pch = 19, frame = FALSE)
abline(lm(y ~ x, data = df), col = "red")
plotting a graph between bmi and log(charges)
x <- df$bmi
y <- log(df$charges)
# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
plot(x, y, main = "Main title",
xlab = "bmi", ylab = "log of charges",
pch = 19, frame = FALSE)
# Add regression line
plot(x, y, main = "Main title",
xlab = "bmi", ylab = "log of charges",
pch = 19, frame = FALSE)
abline(lm(y ~ x, data = df), col = "red")
the red line poorly fits the data, Also the data looks very scattered to be drawing a linear relationship between bmi and charges.
##solving multiple linear regression on the data set
model <- lm(charges ~ age + sex + bmi + children + smoker, data = df)
summary(model)
Call:
lm(formula = charges ~ age + sex + bmi + children + smoker, data = df)
Residuals:
Min 1Q Median 3Q Max
-11837.2 -2916.7 -994.2 1375.3 29565.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12052.46 951.26 -12.670 < 2e-16 ***
age 257.73 11.90 21.651 < 2e-16 ***
sexmale -128.64 333.36 -0.386 0.699641
bmi 322.36 27.42 11.757 < 2e-16 ***
children 474.41 137.86 3.441 0.000597 ***
smokeryes 23823.39 412.52 57.750 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6070 on 1332 degrees of freedom
Multiple R-squared: 0.7497, Adjusted R-squared: 0.7488
F-statistic: 798 on 5 and 1332 DF, p-value: < 2.2e-16
summary(model)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12052.4620 951.26043 -12.6699919 8.099035e-35
age 257.7350 11.90389 21.6513323 2.593679e-89
sexmale -128.6399 333.36051 -0.3858881 6.996412e-01
bmi 322.3642 27.41860 11.7571362 1.954711e-30
children 474.4111 137.85580 3.4413577 5.967197e-04
smokeryes 23823.3925 412.52338 57.7504052 0.000000e+00
sigma(model)/mean(df$charges)
[1] 0.4573875
confint(model)
2.5 % 97.5 %
(Intercept) -13918.5939 -10186.3301
age 234.3826 281.0874
sexmale -782.6087 525.3290
bmi 268.5759 376.1526
children 203.9730 744.8493
smokeryes 23014.1262 24632.6589
#the equation would be y = intercept + 257.7350age + -128.6399sex + 322.3642bmi + 474.4111children + 23823.392
With the given problem, concepts and objectives, various concepts and relations can be learned from the data. We can learn about how the medical charges of a person are not correlated. From the regression lines drawn between BMI and charges or age and charges, it is clear that simple multivariate regression can not be used to draw a conclusion out of the data. from calculating the pearson’s and spearman’s correlation we can also see that the data like age, bmi and charges are not correlated
From the results and EDA observed, data does seems to be in the normal range, as we can see from the histogram of charges and histogram of log charges(somewhat bell curve).
Log transformation of charges data seems a plausible change that can help us predict better results. but there should be either additional data or additional requirements to predict better results for the model.
According to the current data analysis, linear regression is not an optimal method to find solution for this problem.
nrow(df)
[1] 1338
there are 1338 data points in the data set, this might be possibly enough to make a prediction.
the current data has attributes which have a very less correlation with the dependent attribute, thus we might need additional independent attributes to make better predictions.